Blind Hash Compression

ABSTRACT

Techniques are provided for blind hash compression, such as serving, from a computer server system and to a plurality of different computing devices remote from the computer server system, web code and code for reporting status of the computing devices; receiving from one or more of the computing devices, first data that indicates a parameter of the one or more computing devices, the first data in a compressed format; receiving from one or more others of the computing devices, second data that indicates the parameter of the one or more others of the computing devices, the second data in an uncompressed format; and compressing the second data and comparing the compressed second data to the first data to correlate the first data to the second data. The code for reporting status of the computing devices can include code for allowing the computing devices to determine whether to send the first or second data.

CROSS-REFERENCE TO RELATED APPLICATIONS; BENEFIT CLAIM

This application claims the benefit under 35 U.S.C. § 120 as aContinuation of U.S. patent application No. 14/980,231, filed on2015-12-28, which is a Continuation of U.S. patent application No.14/160m107, filed on 2014-1-21, the entire contents of which are herebyincorporated by reference as if fully set forth herein.

FIELD OF THE DISCLOSURE

This document generally relates to computer communications.

BACKGROUND

The approaches described in this section are approaches that could bepursued, but not necessarily approaches that have been previouslyconceived or pursued. Therefore, unless otherwise indicated, it shouldnot be assumed that any of the approaches described in this sectionqualify as prior art merely by virtue of their inclusion in thissection.

Web content, such as HTML or JavaScript for generating web pages, maycontain application-like functionality that is interpreted and executedwithin a visitor's browser, or in a similar application. The generalgoal with HTML and other web technologies is to make them work, and worksimilarly, across many different platforms (e.g., Mac, PC, Linux, etc.).

To maximize the functionality of web content, it can be relevant for asystem that serves the content to know the configurations of computers(whether desktop, smartphone, tablet, or other) that are being servedthe content. For example, particular knowledge can be obtained byidentifying the type of browser that is rendering a web page, theoperating system on which the browser is running, and plug ins thatmight also be operating on such computers. However, this additionalsupporting information must generally be sent from the various clientcomputers to the server system, and such transmission adds overhead tothe functioning of a browser presenting a web page or other application,which overhead is not directly responsible for improving operation ofthe page.

SUMMARY

This document describes systems and techniques by which various usercomputing devices (computers such as desktops, laptops, tablets, andsmartphones) can submit information to a server system in a manner thatlowers the bandwidth required for such reporting. Specifically, certainof the computing devices can send information in a lossy compressedformat (e.g., as a hash of the original information), while others cansend the same information in an uncompressing format (e.g., as theoriginal plaintext).

The compressed format may be highly compressed, such as by a lossyone-way function so that the server system cannot immediately determinewhat original string a compressed submission is indicative of (e.g., viaa hash function or other lossy compression function).

To determine what the compressed submissions represent, the serversystem compresses any received uncompressed submissions (or submittedwith lossless compression) using the same technique used by the clientdevices to perform their compression, at which point the server systemknows the correlation between the uncompressed and compressedrepresentations, and can then correlate any previously- orlater-received compressed representations back to the original raw data.The percentage of the client computers reporting raw data may be muchsmaller than those reporting compressed data, so that the overallbandwidth of the system is substantially reduced. For example, each ofthe computing devices may determine whether it should submit acompressed representation of the data, or instead, an uncompressedrepresentation by generating a random number (again, e.g., usingstandard JavaScript functions), and only send a particular format orrepresentation if the generated number is above or below a predeterminednumber, as the case may be.

The server system may provide a biasing value to the computing deviceswhen it serves web code so as to push the random number higher or lower,so as to affect the likelihood that any particular computing device willsend uncompressed, raw data instead of compressed data. More frequentsubmission of uncompressed representations will allow a server system tomore quickly identify the real meaning of data that newly arrives, e.g.,when new features arrive on the computing devices (e.g., new plug insare announced), but could cause higher bandwidth usage in a pool ofcomputing devices. Thus, an operator of a server system may use thebiasing value to match its desire for fast reaction versus its desirefor lower bandwidth requirements.

To further minimize the amount of data transfer needed, the compressionalgorithm may be one that is available from public libraries, such asstandard JavaScript hash algorithms. In this manner, the server systemmay automatically obtain plaintext representations of new data as itarrives in a pool of computers (e.g., all computers trying to access aparticular retailer's web site), but may also determine how broadly suchinformation has spread without having to send the potentially voluminousplaintext representation for very many of the computing devices.

Generally, hashing algorithms are selective enough that very fewcollisions will be seen between hashes (i.e., two different strings oftext sent by computing devices will seldom generate the same hashvalue). When there are collisions, however, a server system will not beable to determine what is meant by such a compressed value when itarrives (it will be ambiguous as between the two or more source stringsthat generate the compressed value). Thus, the system just discussed mayalso include provisions for resolving such collisions. For example, acomputing device may perform a secondary compression that uses adifferent algorithm than the primary compression, so that if the valuesof both compressions do not match across different submissions, then thesource text for those different submissions is known to be different.Alternatively, or in addition, a length of the source string may also besubmitted as to serve as yet another separate check on the sourcestring.

In particular implementations of such techniques, the collected data maybe configuration data for the computing devices, which may include, forexample, the make and model of the computer, the make and version of theoperating system and the web browser that is being used, the identity ofactive plug ins and other applications currently executing on thecomputing device in addition to the browser, among other things, such asinstalled fonts, screen resolution, etc. Collected data may also includeactivity data that identifies actions that have been taken on thecomputer, including actions by third-party software that appears to beanomalous (e.g., attempts to interact with the revised web code in aninvalid manner). Such data may be collected by one or more centralserver systems for diagnostics purposes, including for identifying thestate of machines when a program throws an error, and for identifyingcommon characteristics of computing devices that are exhibitingfraudulent or other anomalous behavior. For example, a criminal groupmay have a plug in or other software surreptitiously distributed tothousands of computers spread across the world to form a so-called botnet, and the server system discussed here may use reporting informationfrom such computers to more quickly and accurately identify the presenceof a new bot net that is emerging, and the behavior of that bot net(e.g., if common reports of malicious activity are coming from aparticular operating system running a particular browser version).

Various implementations are described herein using hardware, software,firmware, or a combination of such components. In some implementations,a computer-implemented method can include serving, from a computerserver system and to a plurality of different computing devices remotefrom the computer server system, web code and code for reportingparameters of the computing devices; receiving from different ones ofthe computing devices, a plaintext representation of a particularparameter of a first of the computing devices, and a hashedrepresentation of the same parameter of a second of the computingdevices; hashing the plaintext representation of the particularparameter to create a hash value, and comparing the hash value to thehashed representation; and based on a determination that the hash valuematches the hashed representation, correlating the hashed representationto the plaintext representation on the computer server system, whereinthe code for reporting parameters of the computing devices includes codefor allowing the computing devices to determine whether to send aplaintext representation or a hashed representation.

These and other implementations can optionally include one or more ofthe following features. The code for allowing the computing devices todetermine whether to send a plaintext representation or a hashedrepresentation can include biasing data that affects a frequency withwhich the computing devices select to send the plaintext representationor the hashed representation.

The method can further include receiving from the computing devices,plaintext representations and hashed representations of a plurality ofdifferent parameters of the computing devices; hashing the receivedplaintext representations to created hashed values; and usingcorrelations between the hashed values and the received plaintextrepresentations to identify parameters represented by the hashedrepresentations. The method can further include using the hashedrepresentation and the plaintext representation to identifycharacteristics of malware executing on the computing devices.

In some implementations, a computer-implemented method can includeserving, from a computer server system and to a plurality of differentcomputing devices remote from the computer server system, web code andcode for reporting status of the computing devices; receiving from oneor more of the computing devices, first data that indicates a parameterof the one or more computing devices, the first data in a compressedformat; receiving from one or more others of the computing devices,second data that indicates the parameter of the one or more others ofthe computing devices, the second data in an uncompressed format; andcompressing the second data and comparing the compressed second data tothe first data to correlate the first data to the second data, whereinthe code for reporting status of the computing devices includes code forallowing the computing devices to determine whether to send the firstdata or the second data.

These and other implementations can optionally include one or more ofthe following features. The code for allowing the computing devices todetermine whether to send the first data or the second data can includebiasing data that affects a frequency with which the computing devicesselect to send the first data or the second data. The first data can becompressed on the computing devices using hashing. The server system canbe configured to not send hashing algorithm information to the computingdevices. The method can further include using the compressed format torepresent the parameter in identifying aggregate activity by multiple ofthe computing devices. The method can further include determining fromthe aggregate activity by multiple of the computer devices whether onesof the multiple computing devices is infected with malware. The computerserver system can be an intermediary security server system that isseparate from a web server system that generates and serves the webcode. The method can further include comparing information sent with thecompressed second data to information derived from the received firstdata to determine whether the compressed second data was generated fromdata that matches the first data.

In some implementations, one or more non-transitory storage devices canstore instructions that, when executed by one or more computerprocessors, perform operations comprising: serving, from a computerserver system and to a plurality of different computing devices remotefrom the computer server system, web code and code for reporting statusof the computing devices; receiving from one or more of the computingdevices, first data that indicates a parameter of the one or morecomputing devices, the first data in a compressed format; receiving fromone or more others of the computing devices, second data that indicatesthe parameter of the one or more others of the computing devices, thesecond data in an uncompressed format; and compressing the second dataand comparing the compressed second data to the first data to correlatethe first data to the second data, wherein the code for reporting statusof the computing devices includes code for allowing the computingdevices to determine whether to send the first data or the second data.

These and other implementations can optionally include one or more ofthe following features. The code for allowing the computing devices todetermine whether to send the first data or the second data can includebiasing data that affects a frequency with which the computing devicesselect to send the first data or the second data. The first data can becompressed on the computing devices using hashing. The operations canfurther include using the compressed format to represent the parameterin identifying aggregate activity by multiple of the computing devices.The operations can further include determining from the aggregateactivity by multiple of the computer devices whether ones of themultiple computing devices is infected with malware. The computer serversystem can include an intermediary security server system that isseparate from a web server system that generates and serves the webcode. The operations can further include comparing information sent withthe compressed second data to information derived from the receivedfirst data to determine whether the compressed second data was generatedfrom data that matches the first data.

In some implementations, a computer-implemented system includes: a firstdata communication interface arranged to communicate with a web serversystem; a second data communication interface arranged to communicatewith clients that request content from the web server system; acompressed code interpreter programmed to identify an original form ofcompressed content received from particular ones of the clients by (a)compressing original content received from other ones of the clients toform a compressed representation, and (b) comparing the compressedrepresentation to the compressed content received from the particularones of the clients, wherein compressed code interpreter compresses theoriginal content using a technique that matches techniques used by theparticular ones of the clients to compress the content.

These and other implementations can optionally include one or more ofthe following features. The system can be further programmed to providecode to the clients that allows the clients to determine whether toprovide compressed content or instead, uncompressed content to thesystem.

In some implementations, a computer-implemented method can includeserving, from a computer server system and to a plurality of differentcomputing devices remote from the computer server system, web code andcode for reporting parameters of the computing devices; receiving fromdifferent ones of the computing devices, a plaintext representation of aparticular parameter of a first of the computing devices, and a hashedrepresentation of the same parameter of a second of the computingdevices; hashing the plaintext representation of the particularparameter to create a hash value, and comparing the hash value to thehashed representation; and based on a determination that the hash valuematches the hashed representation, correlating the hashed representationto the plaintext representation on the computer server system, whereinthe code for reporting parameters of the computing devices includes codefor allowing the computing devices to determine whether to send aplaintext representation or a hashed representation.

The features discussed here may, in certain implementations, provide oneor more advantages. For example, a security intermediary system may beprovided that does not add an appreciable level of bandwidth to thecommunication channel between a server system and the clients itservices. The intermediary system may collect data that is relativelylarge compared to the bandwidth that it occupies, and may use that datafor diagnosing problems with particular clients, and across largenumbers of clients (e.g., by identifying the spread of malware threats).Moreover, a wide variety of data for various purposes may be transmittedusing these techniques, and may be used for a wide variety of purposesonce it is interpreted at the server system. Moreover, in certainimplementations, the compressed representations can be used as databasekeys, thus further simplifying the operations recited herein.

Other features and advantages will be apparent from the description anddrawings, and from the claims.

The appended claims may serve as a summary of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

In the drawings:

FIG. 1 is a schematic diagram of a system for providing compressedreporting of computing device information using a blind hash.

FIG. 2 is a schematic diagram of a system for performing deflection anddetection of malicious activity with respect to a web server system.

FIG. 3 is a flow chart of a process for reducing bandwidth requirementsbetween computers.

FIG. 4 is a swim lane diagram of a process for transferring data betweenclient computers and a server system.

FIG. 5A is a representation of a state machine for client-side encoding.

FIG. 5B is a representation of a state machine for server-side decoding.

FIG. 6 is a block diagram of a generic computer system for implementingthe processes and systems described herein.

Like reference numbers and designations in the various drawings indicatelike elements.

DETAILED DESCRIPTION

In the following description, for the purpose of explanation, numerousspecific details are set forth in order to provide a thoroughunderstanding of the present invention. It will be apparent, however,that the present invention may be practiced without these specificdetails. In other instances, well-known structures and devices are shownin block diagram form in order to avoid unnecessarily obscuring thepresent invention.

It will be further understood that: the term “or” may be inclusive orexclusive unless expressly stated otherwise; the term “set” may comprisezero, one, or two or more elements; the terms “first”, “second”,“certain”, and “particular” are used as naming conventions todistinguish elements from each other does not imply an ordering, timing,or any other characteristic of the referenced items unless otherwisespecified; the term “and/or” as used herein refers to and encompassesany and all possible combinations of one or more of the associatedlisted items; that the terms “comprises” and/or “comprising” specify thepresence of stated features, but do not preclude the presence oraddition of one or more other features.

General Overview

This document discusses mechanisms for reducing bandwidth between clientcomputing devices and server systems with which they communicate (where“clients” and “servers” are terms used generally, and do not require anysort of formal client-server architecture). Generally, the mechanismsare most useful where many different computing devices will becommunicating the same data to the server system. For example, it may bebeneficial to have computing devices report their configurationinformation to a server system so that the system can identifycommonality in the operations of such devices, for example, to diagnosereasons for faults in the devices or to identify the emergence ofmalware on the devices in a large group of devices (e.g., all devicesthat access a banking or retail web site).

The common data that is communicated may be communicated by some of thecomputing devices in its native form (e.g., plaintext) or another formin which its content can be directly determined (e.g., via losslesscompression or encryption for which the server system receiving the datacan accurately decompress or decrypt the data).

Others of the devices may communicate the same data in a form from whichit cannot be identified directly, such as by submitted a hash of thedata. When the server system receives compressed representations of thetext but has not yet received the original representation, it can saveindications of the compressed representations in association with thecomputing devices from which they were received, without knowing theoriginal representation. When the server system receives anyuncompressed representations, it can compress them using the samealgorithm that the client devices used, can store the correlation of thecompressed representation to the original representation, and can usethat correlation to resolve any compressed representations, whetherassociated with events reported from computing devices in the past orthe future, to determine what the compressed representation actuallyrepresents.

Some or all compressed representations may be accompanied by a secondaryrepresentation, that can be used to identify potential collisionsbetween the compressed representation. In particular, because thecompressed representations are smaller in size than the uncompressedrepresentations, certain compressed representations will end up beingrepeated in a system—so that two identical compressed representationsreceived by a server system could represent different original strings.

Though proper selection of parameters will make such collisionsrelatively rare, where the volume of the different strings that need tobe represented is extensive, the risk of a collision may be relevant.The secondary representation, then, may serve as a check on the mainrepresentation, as it will be extremely unlike that both would matcheven though the original text did not. Such secondary representation maybe transmitted to the server system with the compressed representation,and may be formed, for example, by applying a second hash or othercompression technique to the original text that uses a differentalgorithm, or by sending a value that represents a length of theoriginal string.

The compressed representations or other representations that correspondto the compressed representations may then be passed as identifiers forthe original data to systems that can perform analysis using such data.For example, client devices may pass reports that indicate anomalousactivity, such as efforts by a browser plug-in to access served codeusing defunct function names or the like (e.g., in a system that uses asecurity intermediary to change the function names with each serving ofthe web code).

A fraud detection system may perform clustering analysis on the reportedfeatures of such computing devices, and may use the compressedrepresentations as identifiers for the various reported features inperforming such analysis. The analysis may be used to identify thatdevice having particular characteristics (e.g., IP address, operatingsystem, and browser) that have reported the existence of anomalousbehavior, which may in turn be used to determine whether the anomalousbehavior is benign (e.g., from a plug in that users intentionallyinstalled) or malicious (e.g., code performing a “Man in the Middle”attack on their devices).

FIG. 1 is a schematic diagram of a system 100 for providing compressedreporting of computing device information using a blind hash. Ingeneral, the system 100 is directed to presenting information from a webserver system 108 to a variety of computing devices 114A-C that arelocated remotely from the web server system 108.

Examples of operators of such a web server system 108 include on-lineretailers and on-line banking systems, where the devices 114A-C belongto people trying to buy products or perform on-line bankingtransactions. The web server system 108 is shown as a row of serversalong with a separate row of servers for a security server system 106,both in a single data center facility. Such arrangement is intended toindicate that, in one typical implementation, an operator of a web sitemay supplement its main server system 108 with a security server system106 that it builds itself or that it acquires for a third party.

The security server system 106 may physically and logically between theweb server system 108 and the network, which may include internet 104,and may intercept web code to be served to the various client devices102A-C.

In the described example, the system 100 operates by providing modifiedor recoded web code to the client computing device 102, where themodifications are relative to a web page that would normally be servedto the client computing device without additional security measuresapplied. Web code may include, for example, HTML, CSS, JavaScript, andother program code associated with the content or transmission of webresources such as a web page that may be presented at a client computingdevice 102 (e.g., via a web browser or a native application(non-browser)).

The system 100 can detect and obstruct attempts by fraudsters andcomputer hackers to learn the structure of a website (e.g., theoperational design of the pages for a site) and exploit securityvulnerabilities in the client device 102. For example, malware mayinfect the client device 102 and gather sensitive information about auser of the device, or deceive a user into engaging in compromisingactivity such as divulging confidential information. Man-in-the-middleexploits are performed by one type of malware that is difficult todetect on a client device 102, but can use security vulnerabilities atthe client device 102 to engage in such malicious activity.

Served code 110 shows an example of code that can be served to arequesting one of various of the computing devices 102A-C after therequest is provided to the web server system 108, the content from theweb server system 108 is intercepted or otherwise provided to thesecurity server system 106, and the code is changed and/or supplementedby the security server system 106. Various portions of the served code110 are shown schematically to actions that the security server system106 can take with respect to the code.

Code 110A represents the original web code provided by the web serversystem 108 with certain modifications made to it. For example, thesecurity server system 106 may change the names of functions inessentially random ways every time a set of content for a web page isserved, where the changes are made consistently across the served codeso as not to break internal references between pieces of the code. Forexample, references to a particular function may be made consistentlyacross HTML, CSS, and JavaScript. For example, the following stringsindicate HTML before and after alteration using a random number fortextual replacement:

Original code:

<form action=“login.jsp” method=“post” name=“Login”> <input type=“text”id=“lastname_id” name=“lastname” Re-coded format: <formaction=“login.jsp” method=“post” name=“imp0q6wNm”> <input typ=“text”id=“b24mpqdfKX” name=“aSkFjp5x1Y”

Such changes may be made so that malware on a client device thatreceives the code cannot easily identify the operational structure ofthe web site and/or automatically interact with the code so as tomislead a user into opening its security to the malware (e.g., for a Manin the Middle attack). By making the changes frequently enough andrandomly enough that automated malware cannot interact with itpredictably, the security server system 106 interferes with such attacksby malware.

Instrumentation code 110B is added to the code 110A by the securityserver system 106, and allows the system 100 to detect malware inaddition to deflecting its efforts. In particular, the instrumentationcode 110B can execute in the background on the computing devices 102A-Cand can monitor how the code 110A operates and how other code on theparticular computing device 102A-C interacts with the execution of code110A. For example, the instrumentation code 110B can monitor the DOMmade from the code 110A at different points in time and may report backto security server system 106 information that characterizes the currentstate of the DOM. Such information can be compared to information thatindicates what the DOM should look like in order to determine whetherother side is interfering with the execution of code 110A.Alternatively, or in addition, the instrumentation code can identifyanomalous attempts by third-party code to interact with the operation ofcode 110A, such as for calls made to code 110A using “old” names for thecode (e.g., names that were valid in a prior serving of the relevant webpage but that are no longer relevant because security server system 106is constantly changing the names so as to create a moving target forsuch third-party code to hit).

A user telemetry script 110C is also provided to a requesting one ofcomputing devices 102A-C. The user telemetry script 100C may includecode for managing communications between the relevant client device andthe security server system 106. Such communications may includetransmission of information identified by the instrumentation code 100Bdescribed above, and other relevant information. In certainimplementations, the security server system 106 can be suppliedadditional information using the user telemetry script and after thecode 110A has been served, such as information that affects the mannerin which the instrumentation code 110B operates. For example, thesecurity server system 106 may receive a report from the user telemetryscript 110C that indicates that a third-party program is attempting tointeract with the served code 110A, and may respond so as to have theinstrumentation code 110B perform certain operations to betterunderstand the nature of the interaction occurring on the computingdevice.

A request frequency code 110D may also be sent and may be as simple as asingle number that biases the user telemetry script 110C to returninformation to the security server system 106 in its original form, orinstead in a compressed form. For example, the request frequency code110D that is sent in this example is a value of 1000, which may havebeen selected by the security server system 106 for a range between 0and 1024 in this example. In turn, the user telemetry script 110C may beprogrammed to select a random number between 0 and 1024, and to returnthe original text rather than a compressed version of the original textwhen the randomly-selected number exceeds 1000. As a result, originaltext will be returned by only about 2% of all computing devices that areserved code from the security server system 106 using this requestfrequency value. Others of the computing devices will return acompressed version of the text, such as a hash of the original textproduced by the particular device.

Upon receiving the code 110, the particular client devices 102A-C mayrender respective webpages and establish document object models thatrepresent the served page, in a familiar manner. User interactions withthe webpage and associated code may then begin. At or around that time,the instrumentation code 110B and user telemetry script 110C may executeto return information about the configuration of a particular computingdevice to the security server system 106. For example, the usertelemetry skip script 110C may return data that identifies the operatingsystem of the particular computing device, the model of the particularcomputing device, the amount of RAM loaded on the computing device,other applications executing on the computing device, and similarinformation. In certain implementations, such functionality may beprovided using a browser plug in that is programmed to perform a checkof the environment for the machine on which it is running. Generally,JavaScript or VBScript can permit that measurement of User Agent, otherHTTP header information, indirect measurements of the JavaScriptexecution environment, Plugin information, fonts, and screeninformation.

As shown by the arrow labeled with a 1 in a circle, computing device102A returns the numeric pair 24.16. These numbers represent,respectively, a hash of a textual string that represents the name andmodel of the browser that is running on computing device 102A. In theexample here, all three computing devices 102A-C are running the “Chrome2.3.21.04” browser release, as an example. Such information may beobtained by making a request that is to be responded to with the “useragent” string on the particular computing device, in a familiar manner.In the current example, computing device 102A delivered this compressedrepresentation of the user agent string, because it generated a randomnumber of 300, which is less than the request frequency number of 1000.

Similarly, when computing device 102B received the served code 110, itgenerated a random number of 674, meaning that it too would send acompressed version of the user agent string, or 24.16. In both theseexamples, 24 has been selected as an example to represent a hash thatmay be created from such a string, and the number 16 represents thenumber of characters in that string.

The actual string itself can be seen as being transmitted from computingdevice 102C back to security server system 106. Here, computing device102C selected a random number of one 1012, which is greater than therequest frequency number of 1000. As a result, computing device 102Cwill be one of the 2% of all devices that report back the original,uncompressed (unhashed) version of the user agent string.

To better show the level to which an initial string can be compressed,the user agent string for Firefox on an Ipad is “Mozilla/5.0 (iPad; U;CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, likeGecko) Mobile/7B405.” A compressed representation that indicates a hashand a length might be of the form 4528.111. As can be appreciated, thebandwidth for the latter is much lower than for the former.

In the figure, operations of the security server system 106 performed inresponse to receiving the communications from devices 102A-C are shownschematically as a two-column database entry below security serversystem 106 and Web server system 108. The two columns are shown toindicate how a system may associate a compressed version of a stringwith the actual string itself. In a first representation shown by a 1 ina circle and corresponding to actions that would occur in response tothe first transmission from computing device 102A, the database has beenpopulated with the hash value of 24 upon receiving that hash value formdevice 102A. The system 100 does not, at that point, know what theoriginal string representation for that value is (assume that the systemdid not receive earlier communications regarding the user agent stringfrom other device), but stores the hash value 24 in anticipation thatwill eventually be able to determine what the original, plaintext valueis.

In a second representation shown by the number 2 in a circle andrepresenting the transmission from computing device 102B, the table hasnot changed because again, the security server system 106 received onlythe hash code, and not the original version of the user agent string.Finally, at the bottom of the representation, the system receives astring of original plaintext and, as shown by the arrow labeled with“hash,” the system performs a hashing function on that plaintext that isthe same as a hashing function that the system 100 knew to be providedby the computing devices 102A-C. For example, each of the computingdevices 102A-C and the security server system 106 may be programmed touse the same hashing algorithm as the Java hash algorithm, which is wellknown and readily available on many computing platforms.

With that hash value (24) in hand, the security server system 106 maysearch the table for a matching value, and when it finds such a matchingvalue, it may determine that that matching hash value is whatcorresponds to the original text. It may then update the table tocorrelate the particular hash value with the particular originalplaintext. Such a correlation is shown in the row of the table labeledwith a 3 in a circle.

This correlation may then be used with other parts of the system. Forexample, the number 24 can be used throughout the system to representthe user agent string represented here (i.e., as a unique database indexvalue). As some examples, a cluster analysis system like that discussedwith respect to FIG. 2 below may use the number 24 to represent such afeature instead of using the full string representation. In otherembodiments, yet a third representation for the feature may be used asan index representation.

The processing of the communication from computing device 102C may alsobe accompanied by a determination that the full string is 16 charactersin length. Such a value may be stored in yet a third column of the table(not shown) and may be correlated to the hash value and the originalplaintext of the string. When later communications arrive with a hashvalue of 24, they may be compared to the first column shown in thetable, and their accompanying value of 16 may be compared to thisadditional value to provide more confidence that the hash value isunique to this particular original textual string. As discussed above,other techniques may also be used to ensure that there are no collisionsin the hash values, such as by returning an additional number or otherrepresentation that is generated by an alternative hash algorithm. Incertain implementations, if the security server system 106 identifiedthat there may be a problem with a received hash value, the securityserver system 106 may provide a special message to the respondingcomputing device to trigger the responding computer device to transmitthe original plaintext code instead of the hash value.

Other tables may store additional relationships that are of value inoperating the system 100. For example, one table may store identifiersfor particular ones of the computing devices 102A-C, where a particulardevice may be identified by a cookie that it stores and passes to thesecurity server system 106. That device identifier may then be relatedto the variety of parameters, such as the user agent parameter justdiscussed, and additional parameters, which may include hardwareidentifiers, operating system identifiers, and software identifiers,among other things. By this mechanism then, the system 100 may correlatea particular device to particular configuration information and toconfiguration parameters reported by the device.

This particular example is highly simplified for purposes of clarity. Ina typical implementation, many different webpages and other Webresources will be served by system 100 to many different computingdevices. Thus, a large number of different hash values will be receivedby security server system 106 in an interleaved fashion with each other,and the system will need to correlate those hash values or othercompressed values with particular original text represented by thosevalues. Such multi-value implementation may occur, for example, byadding additional records to the simple table shown here, or by otherappropriate techniques.

A server system can also specify a seed to be used before generating arandom number, or specify another random number generation method (andthe initial state of the pseudorandom number generator (PRNG), and thechoice threshold value, such that the sequence of fields chosen will beknown by the server. This can be used to force the client to generate an“uncompressed” value for a field that is unknown by the client. It canalso be used to allow the server to have more control over the data flow(more or less data), and can even be used as a mechanism for determiningwhen a malicious client is sending data in a non-compliant format, whichcould be used to determine that the client is, in fact, addled withmalware.

From time to time, hash values that have already been correlated withoriginal text may also be tested by other incoming original text. Forexample, the security server system 106 might not normally perform ahash on incoming original text if the system 106 has determined thatthere is already a correlation for that text in the table.

However, a random number approach similar to that used on the computingdevices 102A-C may be used so that the security server system 106periodically does perform such a hashing and comparison so as to confirmthe accuracy of the data in the table. If the system 106 determines thatthere is an inaccuracy, because the hash value generated for an incomingstring of text does not match a pre-existing hash value in the systemfor that text, the system 106 may generate an exception and alert anoperator of the system 106.

Also, the example here is stated in terms of Web code being served to ageneral web browser. Other types of code may alternatively be served toother types of applications. In such situations, those otherapplications may be caused to choose whether to return original orcompressed configuration information, or such decisions may be made bycode separate from the applications but made in respect to the executionof the applications.

Also, although the techniques discussed here have been associated withcommunications for the delivery of information related to browserenvironment and user or automated interactions with web pages, they mayalso, in appropriate circumstances, be applied more generally. Forexample, other data that is reported at periodic intervals and is commonas between a substantial portion of those reporting events, may becompressed using the techniques here, and interpreted using uncompressed(or losslessly compressed) messages in some instances of the reporting,and lossy compressed messages corresponding to the same content in otherinstance of reporting. Various mechanisms, including those discussedabove and below, may be used to identify that the compressed anduncompressed messages match each other in their content, and to thenassociate the compressed messages with the uncompressed content.

In addition, while the techniques are described here as involvingtransmission of data to a server system from web code served to abrowser, other techniques may also be used. For example, a stand-aloneapplication for a particular organization may report information to aserver system, and may be programmed to use thesometimes-compressed/sometimes-uncompressed techniques described here totransmit necessary data to the server system (particularly when the datais largely repetitive as between different reporting events for thedata).

FIG. 2 is a schematic diagram of a system 200 for performing deflectionand detection of malicious activity with respect to a web server system.The system 200 may be the same as the system 100 discussed with respectto FIG. 1, and is shown in this example to better explain theinterrelationship of various general features of the overall system 200,including the use of the reporting of compressed and uncompressedversions of the same strings in order to conserve bandwidth (forcompressed representations) and to determine what the compressedrepresentations represent (for uncompressed representations).

The system 200 in this example is a system that is operated by or for alarge number of different businesses that serve web pages and othercontent over the internet, such as banks and retailers that have on-linepresences (e.g., on-line stores, or on-line account management tools).The main server systems operated by those organizations or their agentsare designated as web servers 204 a-204 n, and could include a broadarray of web servers, content servers, database servers, financialservers, load balancers, and other necessary components (either asphysical or virtual servers).

A set of security server systems 202 a to 202 n are shown connectedbetween the web servers 204 a to 204 n and a network 210 such as theinternet. Although both extend to n in number, the actual number ofsub-systems could vary. For example, certain of the customers couldinstall two separate security server systems to serve all of their webserver systems (which could be one or more), such as for redundancypurposes. The particular security server systems 202 a-202 n may bematched to particular ones of the web server systems 204 a-204 n, orthey may be at separate sites, and all of the web servers for variousdifferent customers may be provided with services by a single common setof security servers 202 a-202 n (e.g., when all of the server systemsare at a single co-location facility so that bandwidth issues areminimized).

Each of the security server systems 202 a-202 n may be arranged andprogrammed to carry out operations like those discussed above and belowand other operations. For example, a policy engine 220 in each suchsecurity server system may evaluate HTTP requests from client computers(e.g., desktop, laptop, tablet, and smartphone computers) based onheader and network information, and can set and store sessioninformation related to a relevant policy. The policy engine 220 may beprogrammed to classify requests and correlate them to particular actionsto be taken to code returned by the web server systems (for transmissionto requesting clients) before such code is served back to a clientcomputer. When such code returns, the policy information may be providedto a decode, analysis, and re-encode module 224, which matches thecontent to be delivered, across multiple content types (e.g., HTML,JavaScript, and CSS), to actions to be taken on the content (e.g., usingXPATH within a DOM), such as substitutions, addition of content, andother actions that may be provided as extensions to the system. Forexample, the different types of content may be analyzed to determinenaming that may extend across such different pieces of content (e.g.,the name of a function or parameter), and such names may be changed in away that differs each time the content is served, e.g., by replacing anamed item with randomly-generated characters. Elements within thedifferent types of content may also first be grouped as having a commoneffect on the operation of the code (e.g., if one element makes a callto another), and then may be re-encoded together in a common manner sothat their interoperation with each other will be consistent even afterthe re-encoding.

A rules engine 222 may store analytical rules for performing suchanalysis and for re-encoding of the content. The rules engine 222 may bepopulated with rules developed through operator observation ofparticular content types, such as by operators of a system studyingtypical web pages that call JavaScript content and recognizing that aparticular method is frequently used in a particular manner. Suchobservation may result in the rules engine 222 being programmed toidentify the method and calls to the method so that they can all begrouped and re-encoded in a consistent and coordinated manner.

The decode, analysis, and re-encode module 224 encodes content beingpassed to client computers from a web server according to relevantpolicies and rules. The module 224 also reverse encodes requests fromthe client computers to the relevant web server or servers. For example,a web page may be served with a particular parameter, and may refer toJavaScript that references that same parameter. The decode, analysis,and re-encode module 224 may replace the name of that parameter, in eachof the different types of content, with a randomly generated name, andeach time the web page is served (or at least in varying sessions), thegenerated name may be different. When the name of the parameter ispassed back to the web server, it may be re-encoded back to its originalname so that this portion of the security process may occur seamlesslyfor the web server.

A key for the function that encodes and decodes such strings can bemaintained by the security server system 202 along with an identifierfor the particular client computer so that the system 202 may know whichkey or function to apply, and may otherwise maintain a state for theclient computer and its session. A stateless approach may also beemployed, whereby the system 202 encrypts the state and stores it in acookie that is saved at the relevant client computer. The clientcomputer may then pass that cookie data back when it passes theinformation that needs to be decoded back to its original status. Withthe cookie data, the system 202 may use a private key to decrypt thestate information and use that state information in real-time to decodethe information from the client computer. Such a statelessimplementation may create benefits such as less management overhead forthe server system 202 (e.g., for tracking state, for storing state, andfor performing clean-up of stored state information as sessions time outor otherwise end) and as a result, higher overall throughput.

An instrumentation module 226 is programmed to add instrumentation codeto the content that is served from a web server. The instrumentationcode is code that is programmed to monitor the operation of other codethat is served. For example, the instrumentation code may be programmedto identify when certain methods are called, when those methods havebeen identified as likely to be called by malicious software. When suchactions are observed to occur by the instrumentation code, theinstrumentation code may be programmed to send a communication to thesecurity server reporting on the type of action that occurred and othermeta data that is helpful in characterizing the activity. Suchinformation can be used to help determine whether the action wasmalicious or benign.

The instrumentation code may also analyze the DOM on a client computerin predetermined manners that are likely to identify the presence of andoperation of malicious software, and to report to the security servers202 or a related system. For example, the instrumentation code may beprogrammed to characterize a portion of the DOM when a user takes aparticular action, such as clicking on a particular on-page button, soas to identify a change in the DOM before and after the click (where theclick is expected to cause a particular change to the DOM if there isbenign code operating with respect to the click, as opposed to maliciouscode operating with respect to the click).

Data that characterizes the DOM may also be hashed, either at the clientcomputer or the server system 202, to produce a representation of theDOM (e.g., in the differences between part of the DOM before and after adefined action occurs) that is easy to compare against correspondingrepresentations of DOMs from other client computers.

Other techniques may also be used by the instrumentation code togenerate a compact representation of the DOM or other structure expectedto be affected by malicious code in an identifiable manner.

The instrumentation module 226 or another component may also provide auser telemetry script or other code for causing the client devicereceiving the other code to communicate with the server system after thecode is transmitted. Such additional code may include code that causesthe client devices to return configuration information about themselves,and to control whether they return the information in a compressed ornative state, in the manners described above. The module 226 may alsogenerate and provide to the client devices a request frequency valuethat helps control how often the native text is transmitted back to thesystem instead of the compressed form of the text. One or more modulesmay also control the receipt of such configuration information, thestorage of the information, and the correlation of the compressed data(e.g., being used as an index value for a table) and the correspondingoriginal form of the data.

As noted, the content from web servers 204 a-204 n, as encoded bydecode, analysis, and re-encode module 224, may be rendered on webbrowsers of various client computers. Uninfected client computers 212a-212 n represent computers that do not have malicious code programmedto interfere with a particular site a user visits or to otherwiseperform malicious activity. Infected client computers 214 a-214 nrepresent computers that do have malware, or malicious code (218 a-218n, respectively), programmed to interfere with a particular site a uservisits or to otherwise perform malicious activity. In certainimplementations, the client computers 212, 214 may also store theencrypted cookies discussed above and pass such cookies back through thenetwork 210. The client computers 212, 214 will, once they obtain theserved content, implement DOMs for managing the displayed web pages, andinstrumentation code may monitor the respective DOMs as discussed above.Reports of illogical activity (e.g., software on the client devicecalling a method that does not exist in the downloaded and renderedcontent) can then be reported back to the server system.

The reports from the instrumentation code may be analyzed and processedin various manners in order to determine how to respond to particularabnormal events, and to track down malicious code via analysis ofmultiple different similar interactions across different clientcomputers 212, 214. For small-scale analysis, each web site operator maybe provided with a single security console 207 that provides analyticaltools for a single site or group of sites. For example, the console 207may include software for showing groups of abnormal activities, orreports that indicate the type of code served by the web site thatgenerates the most abnormal activity. For example, a security officerfor a bank may determine that defensive actions are needed if most ofthe reported abnormal activity for its web site relates to contentelements corresponding to money transfer operations—an indication thatstale malicious code may be trying to access such elementssurreptitiously.

A central security console 208 may connect to a large number of webcontent providers, and may be run, for example, by an organization thatprovides the software for operating the security server systems 202a-202 n. Such console 208 may access complex analytical and dataanalysis tools, such as tools that identify clustering of abnormalactivities across thousands of client computers and sessions, so that anoperator of the console 208 can focus on those clusters in order todiagnose them as malicious or benign, and then take steps to thwart anymalicious activity.

In certain other implementations, the console 208 may have access tosoftware for analyzing telemetry data received from a very large numberof client computers that execute instrumentation code provided by thesystem 200. Such data may result from forms being re-written across alarge number of web pages and web sites to include content that collectssystem information such as browser version, installed plug-ins, screenresolution, window size and position, operating system, networkinformation, and the like. In addition, user interaction with servedcontent may be characterized by such code, such as the speed with whicha user interacts with a page, the path of a pointer over the page, andthe like. The telemetry data may also include the received data thatcharacterizes the then-current conditions of each of the client devices,such as the browser and operating systems that they were running, andother appropriate information.

Such collected telemetry data, across many thousands of sessions andclient devices, may be used by the console 208 to identify what is“natural” interaction with a particular page that is likely the resultof legitimate human actions, and what is “unnatural” interaction that islikely the result of a bot interacting with the content.

Statistical and machine learning methods may be used to identifypatterns in such telemetry data, and to resolve bot candidates toparticular client computers. Such client computers may then be handledin special manners by the system 200, may be blocked from interaction,or may have their operators notified that their computer is potentiallyrunning malicious software (e.g., by sending an e-mail to an accountholder of a computer so that the malicious software cannot intercept iteasily).

FIG. 3 is a flow chart of a process for reducing bandwidth requirementsbetween computers. In general, the process involves providing clientcomputers with code that causes the computers to report back aspects oftheir operation. Different ones of the client computers are caused toreport the information in compressed form, while others of the clientdevices are caused to report the same information in an originaluncompressed, or plaintext form. The process can then use thecombination of compressed and uncompressed reported information tocorrelate the compressed representations with the uncompressedrepresentations, even though no particular computer or transmissionprovided such a correlation for the server system that served the code.The server system may make the correlation, for example, by performing acompression of received uncompressed code in a manner that matches theway that one or more of the client devices performed the compression ofthe same code or data.

The process begins at box 302, where the server system serves Web codeto a plurality of different client devices. The Web code may be code fora particular webpage, for multiple related webpages, or for variousunrelated webpages associated with different websites, includingwebsites from different domains. In certain implementations, the Webcode may be recoded from what is initially served by a Web servers, suchas by rewriting the names of particular functions or other elements inunpredictable manners but in a way that is consistent across all of theelements being served (e.g., so that the code does not break whenexecuted and so that calls made to a particular function or otherelement are changed according to the changes made in the name of theelement).

At box 304, supplemental code is served by the system. The supplementalcode may be served along with the Web code in a single transaction, ormay be served separately. The supplemental code may include, forexample, instrumentation code and telemetry code that causes thereceiving client device to monitor the operation of the Web code that isserved to the device and potentially to report back on such operation toa security server system, if the monitoring determines that anomalousactivity is occurring on the client device. Other code may also beserved, such as parameter values that may affect the way in which thesupplemental code operates, such as a request frequency number describedabove, and other appropriate values.

At box 306, the server system may have waited after serving both the Webcode and the supplemental code, and may subsequently receive, from theclient or clients to whom the code was served, hashed representationsfor configuration. Those representations may represent a variety ofparameters that are relevant to the client devices from which they come,including identifiers for the current configuration state of aparticular client device. The particular parameter may be identified,and the value of the identified may be identified by the hash code thatone of the client devices generated by hashing the plaintext parametervalue. A number of different parameters may be reported on for eachclient device, and even more parameters may be reported on across auniverse of client devices. For example, Web code served from a certainwebpage may be accompanied by instrumentation code that reports back oncertain parameters of a device, while Web code served for anotherwebpage may be accompanied by code that reports back on otherparameters.

When the system receives such hashed representations, it may save them,as shown at box 308, even though it does not at that time know whatoriginal values they represent. Such representations may also beassociated with identifiers for client devices from which they werereceived, so that the particular configuration information for thosedevices may be determined later, even if it cannot be determined whenthe hashed representations are initially received.

At box 310, plaintext representations are received from one or moreclient devices. The plaintext representations may have been transmittedby those client devices in response to the client devices executinginstrumentation or telemetry code that instructed the transmission ofsuch plaintext versions of the information to be transmitted (e.g., uponthe client device choosing to transmit plaintext rather than acompressed representation). When the security system receives plaintextrepresentations from telemetry code, it may be programmed to firstcompress those plaintext representations such as by hashing them. Thecompression may occur according to a mechanism that matches a knownhashing mechanism to be operating on the client devices in cooperationwith the instrumentation and telemetry code that was served to thoseclient devices.

With the plaintext representations having been hashed, the securitysystem will now have a correlation between a particular plaintextrepresentation and a particular hash value. The system may then comparethat hash value to any of the hashed values that have previously beenreceived, at box 314, and may then correlate whateverpreviously-received hash values were received to the plaintextrepresentation that was later received, at box 316. In certain examples,the initial transfer of a particular piece of data may be in plaintextform, so that the database would be populated with a plaintextrepresentation and a hash representation simultaneously. Latertransmissions of plaintext representations may simply be matched againstthe plaintext column of the database, and the devices that sent thoseplaintext representations may be correlated with the hashed value as anindex value for those devices. Alternatively, the plaintext values thatare later received may always be hashed, and the hashed values may becompared against the database if that is a more efficient operation ofthe system computationally. Also, periodically, plaintextrepresentations and their hash values may be checked against the tableto ensure that there are no errors in the data. In addition, othervalues that represent the plaintext may be transmitted along with thehashed representations of the plaintext so as to ensure that the systemis not receiving overlapping hash values that match each other but thateach represent different plaintext representations.

At box 318, characteristics of infected computers are identified usinginformation gleaned from the previous steps. For example, the hashedvalues may be used as data in statistical analysis techniques, such astechniques that may attempt to identify clusters of activity within apopulation of computers, such as a population of hundreds of thousandsof computers. Clustering may indicate anomalous activities by thosecomputers, and the hash values may then be used to determine whatconfiguration information is possessed in common by computers withinthat cluster. As one example, the analysis may determine that a largemajority of computers having anomalous behavior are running a recentlyreleased operating system or browser version (i.e., that anomalousbehavior is clustered around a dimension associated with that particularvalue of the user agent parameter for a population of machines). Such adetermination may be evidence of a vulnerability of such browser oroperating system version to Mal Ware. An operator of the systemdescribed here may then act upon such information, such as to cause thebrowser or operating system to be updated or the security hole tootherwise be plugged.

FIG. 4 is a swim lane diagram of a process for transferring data betweenclient computers and a server system. In general, the process, likethose discussed above, involves transmitting content to a server system,in most instances, in a compressed manner from which the identity of theoriginal content cannot be determined (a lossy compression like forminga hash). In a small number of cases, the content can be transmitted inan uncompressed or losslessly compressed form, the received data may becompressed using a process equivalent to the process that was used byclients on the other received content, and the compressed form may bematched to the compressed forms received in that other received content.In this way, the original form of the other received content (both pastand future) can be inferred.

The process begins at box 402, where a client device requests a webpage, such as via a GET or POST method. Such a request may be directedto a particular URL served by a web server system of a particularorganization. The request may result in the web server systemidentifying appropriate code to respond to the request, which mayinclude static code and dynamic code, and may take the form of HTML,CSS, and JavaScript, among others. At box 404, the web server systemserves the responsive code.

The served code is intercepted at box 406 by a security server systemthat, e.g., the operator of the web server system has added as anintermediary for providing security for the web server system. Forexample, a third party may provide a security system that can be addedmodularly to a company's web server system without having to affect theweb server system in any substantial manner. In other implementations,the intermediary functionality may be integrated in the web serversystem. Also, the intermediary server system may be physically locationwithin the same building as the web server system (for minimizinglatency and maximizing the ability to coordinate systems) or in aseparate location that requires communication through a network,including the Internet,

At box 406, the security server system intercepts the code and modifiesit. For example, as described above, the names of certain functions maybe changed in a sufficiently random or arbitrary manner that the newnames cannot be anticipated by malware running on the clients. Thechanges may be coordinated across different types of code (e.g., HTML,CSS, and JavaScript) where the names occur, so that the code functionsthe same as the code it replaced. Generally, the changes are made tolatent code whose operation a user does not see, and static code.

At box 408, the code is appended with monitoring and reporting code.Such code may monitor the DOM that is created on the client when theserved code is rendered, or may monitor attempts to interact with thecode, and may characterize and report any abnormal activity. Such codemay also report other status information about a client, such asconfiguration information that describes the features of the clientsystem. In certain situations, a complete picture of what is occurringin the browser or other application (e.g., a specific app programmed forthe company that serves the code). The reporting code may in particularinclude code for making a determination whether to report particularinformation in a compressed versus an uncompressed form, and then totransmit the data back to the server system accordingly.

At box 410, the client renders the web page by executing the varioustypes of served code, and perhaps by acquiring code form other sourcesin addition to the code that was initially served by the web serversystem (whether from the organization that operates the web serversystem or from one or more other organizations). As described throughoutthis process, the serving and executing of code described here would berepeated across thousands or more different client devices that may eachvary in different ways, such as by having different base (the basiccomputer) and extended hardware (e.g., added graphics cards or RAM),operating systems, installed and executing applications, and executingbrowser plug ins. Thus, each rendering of the web page may be performedin a different manner for different ones of the client devices, and evenfor the same client device in different sessions.

At box 412, the client device generates characterization and activitydata that is to be sent back to the server systems. The box is labeledwith a “1” to indicate that this step represents a subset of the devicesthat are served the web code, and are the devices that hash the datathat is to be reported so as to lower the bandwidth required for suchreporting. Generally, the vast majority of instances would beestablished to report in such a manner so as to significantly reduce theoverhead of transmitting the data.

In this example, characterization data represents status of the clientdevice, such as hardware and software on the device, whereas activitydata represents actions that have occurred on the device, particularlysince the device received the served web code (e.g., activities betweenthe served code and other code that is on the device). Thecharacterization data may be sent to one server system, while theactivity data may be sent to another, or they may be sent to the sameserver system. Also, certain data may be sent according to thecompressed/uncompressed scheme described in this document when the datais expected to be common across many devices, so that the original valueof the content for devices that compress their content can be inferredfrom the uncompressed content (where, unless otherwise noted,uncompressed content includes content whose original form can bedetermined by a server system that receives it, and thus includeslosslessly compressed content). Other data may be sent in a normalmanner, without the pairing of compressed/uncompressed transmission,such as where the content is not typically common as among differentmachines, so that there would be relatively little value in trying toinfer the original content from transmissions made by other machines.

At box 414, an analysis system receives the reported data, which mayinclude activity data. The analysis system may use such activity data toidentify that certain normal or anomalous activities have occurred on acertain device, and may conduct analysis on similar activity datareceived from a large number of other devices to identify clusters ofcommon activity so as to determine that malware is taking advantage ofsuch devices. The analysis system may also be provided withcharacterization data so that it can determine characteristics of thedevices that are being affected by the malware.

Separately, or as part of the same communication, the client device mayprovide similar data to the security server system, as indicated at box416. The security server system may then associate the particular clientdevice with the hashed forms of the compressed content that is sent (asthe analysis system may do if it receives only hashed data). At thispoint in the example process, the security server system has received nounhashed form of the content, so it does not know what the original formof the content was. As a result, the system may simply associate anidentifier for the particular device with the hashed form of thereceived content (or may simply index upward a count of the number ofclients reporting the content of the particular form of hash). In thisexample, multiple different fields may be reported in a hashed manner,such as one or more fields that identify hardware for a device, and oneor more fields that identify software executing on the device. Eachfeature of the device (e.g., make and model, operating system, amount ofRAM, etc.) may receive its own hash, or groups of features may receive asingle hash—where each hash is selected so as to cover content that islikely to be common across many devices, so that the hash value may bereadily reverse-engineered when an uncompressed version of the contentis received from another device.

At box 418, another client (indicated by the circled “2”) also generatesand reports characterization and activity data. In this instance, theparticular device does not compress the content that it reports—e.g.,because it selected a number pseudo-randomly that does not exceed apredetermined level that was provided with the web page code. Theanalysis system may receive at least some of the generated content atbox 420 (which may be the same content as received at box 414 or maycontain some fields whose parameters are the same as those received atbox 414), though here the content would be received in uncompressed form(e.g., either as plaintext or in a losslessly compressed format). To theextent the analysis system previously received parameters for certainfields in compressed format, it may compress the received uncompressedcontent to form a hash value and may then compare it to compressedcontent that was previously received. If the hash value matches a hashvalue stored form Box 414, then the original content may be associatedby the system with the other devices that previously reported the hashedvalue, as may future devices that report the hashed value.Alternatively, or in addition, the analysis system can add to a numberof devices that have reported as having the particular parameter.

Similarly, the second client device can report the characterization andactivity data to the security server system, and at box 422, that systemcan generate a hash value for it. As with the analysis system, certainother fields may have been reported in hashed form or may always bereported by all devices in uncompressed form.

At box 424, the security server system associates the particularparameters received from box 412 with the other instances of reportingthe same content (as determined by comparing the just-generated hashvalue with previously-received hash values form the other devices).

In situations where the analysis system does not separately trackassociations between particular device IDs and content reported by thosedevices, the analysis system can request ID and parameter data (box 426)from the security server system. The security server system (box 428)may gather and transmit such data, and the analysis system may identifycommon features of anomalously-acting machines (box 430) using suchdata. In other words, in one implementation, the analysis system mayreceive activity data and use such data to identify clusters of commonactivity, or otherwise identify potential problems that arise in theoperation of a number of different client devices. At the time of suchinitial analysis, the analysis system may not know the characterizationdata for the devices, and may only seek such data from the securityserver system after identifying the problem. Such follow-up informationgathering may then be used by the analysis system to identify featuresof the devices that are determined to be acting anomalously, such as bydetermining that they all are executing the same browser program, andperhaps a common version or range of versions of that program. In yetother embodiments, the analysis system may repeat operations that areperformed by the security server system, such as in the inferring of theoriginal content of compressed messages via compressing of receiveduncompressed messages.

Also, the security server system and the analysis server system may bepart of the same system or separate systems. For example, a retailer maymanage both systems along with a web server system. In another example,a third-party may operate the analysis server system from its ownfacility, and can assist customers with operating their particularsecurity server systems on their premises, with their web serversystems. The third-party may aggregate activity over a large number ofserved content in such manner, and may more readily identify anomalousbehavior than could a single organization serving only a fraction ofsuch content.

FIGS. 5A and 5B show, respectively, state diagrams for a client and aserver operating according to the mechanisms described above. Referringspecifically to the client encoding state machine of FIG. 5A, at box502, the client device begins its operations by which it preparesinformation for transmission to the server system and performs thetransmission. At box 504, a determination is made whether more fieldsneed to be encoded for transmission. If not, then the client waits untila next time that processing and transmission is needed.

At box 506, if more fields need to be transmitted to the server, arandom number (which may be pseudorandom or otherwise less than exactlyrandom) is generated at box 506. The number may be generated for eachoverall transmission or for each field within a transmission (so thatsome field values may be compressed and some not). At box 508, theclient determines whether the generated number exceeds a threshold. Thatthreshold may be a predetermined value that is relatively permanent andstored by the client for a long time, or may be highly variable, wherethe threshold is transmitted with code recently received by the client,or is accessed at run time by the client (e.g., by submitting a GETfunction to a remote server system). If the generated number exceeds thethreshold, then the client sends a raw version of the relevant field,such as in plaintext or losslessly compressed form of the content forthe field (box 512). If the threshold is not exceeded, then a hashedversion of the content is sent (box 510). Of course, the determinationmay be made inversely, so that the hashed form is sent if the thresholdis exceeded (and/or matched), and the raw data is sent if it is not(and/or is matched).

Referring now to FIG. 5B, there is shown a state diagram for a serverthat interacts with the operations of the client just described. At box520, the process begins, such as by the server determining that it hasreceived data for a plurality of fields, where the data needs to beinterpreted by the server system. If there are no more fields toprocess, the system returns to a rest state, but if there are, then theserver analyzes the next field in line and determines whether it is inraw form or hash form (box 524). If it is in raw form, then the serverhashes the field using a hashing technique that matches a technique thatthe server knows to be performed by various clients that are reportingdata to it (box 532). The server then associates the hash result withthe raw content (box 534). The system can then use such a correlationbetween the hash result and the raw data to interpret othercommunications form other clients that contain only the hash result. Inparticular, the system can use the correlation to infer what the rawform at the client was when only the hash form is received.

If the field is not in raw form (is in hash form), the system performs alookup on the hash form (box 526). The system determines whether thehash form of the field is found in the system (box 528), so as toindicate that a correlation has already been stored between the hashform and the raw form. If the hash form is found, then the system canget the raw value a(box 530) and act accordingly. If the field is notfound (e.g., because the value for the field has not previously beenreceived in raw form), then the occurrence of the receipt of the hashform from the client may be saved and noted, and the system may returnto check if additional fields need processing.

FIG. 6 is a schematic diagram of a computer system 600. The system 600can be used for the operations described in association with any of thecomputer-implement methods described previously, according to oneimplementation. The system 600 is intended to include various forms ofdigital computers, such as laptops, desktops, workstations, personaldigital assistants, servers, blade servers, mainframes, and otherappropriate computers. The system 600 can also include mobile devices,such as personal digital assistants, cellular telephones, smartphones,and other similar computing devices. Additionally the system can includeportable storage media, such as, Universal Serial Bus (USB) flashdrives. For example, the USB flash drives may store operating systemsand other applications. The USB flash drives can include input/outputcomponents, such as a wireless transmitter or USB connector that may beinserted into a USB port of another computing device.

The system 600 includes a processor 610, a memory 620, a storage device630, and an input/output device 640. Each of the components 610, 620,630, and 640 are interconnected using a system bus 650. The processor610 is capable of processing instructions for execution within thesystem 600. The processor may be designed using any of a number ofarchitectures. For example, the processor 610 may be a CISC (ComplexInstruction Set Computers) processor, a RISC (Reduced Instruction SetComputer) processor, or a MISC (Minimal Instruction Set Computer)processor.

In one implementation, the processor 610 is a single-threaded processor.In another implementation, the processor 610 is a multi-threadedprocessor. The processor 610 is capable of processing instructionsstored in the memory 620 or on the storage device 630 to displaygraphical information for a user interface on the input/output device640.

The memory 620 stores information within the system 600. In oneimplementation, the memory 620 is a computer-readable medium. In oneimplementation, the memory 620 is a volatile memory unit. In anotherimplementation, the memory 620 is a non-volatile memory unit.

The storage device 630 is capable of providing mass storage for thesystem 600. In one implementation, the storage device 630 is acomputer-readable medium. In various different implementations, thestorage device 630 may be a floppy disk device, a hard disk device, anoptical disk device, or a tape device.

The input/output device 640 provides input/output operations for thesystem 600. In one implementation, the input/output device 640 includesa keyboard and/or pointing device. In another implementation, theinput/output device 640 includes a display unit for displaying graphicaluser interfaces.

The features described can be implemented in digital electroniccircuitry, or in computer hardware, firmware, software, or incombinations of them. The apparatus can be implemented in a computerprogram product tangibly embodied in an information carrier, e.g., in amachine-readable storage device for execution by a programmableprocessor; and method steps can be performed by a programmable processorexecuting a program of instructions to perform functions of thedescribed implementations by operating on input data and generatingoutput. The described features can be implemented advantageously in oneor more computer programs that are executable on a programmable systemincluding at least one programmable processor coupled to receive dataand instructions from, and to transmit data and instructions to, a datastorage system, at least one input device, and at least one outputdevice. A computer program is a set of instructions that can be used,directly or indirectly, in a computer to perform a certain activity orbring about a certain result. A computer program can be written in anyform of programming language, including compiled or interpretedlanguages, and it can be deployed in any form, including as astand-alone program or as a module, component, subroutine, or other unitsuitable for use in a computing environment.

Suitable processors for the execution of a program of instructionsinclude, by way of example, both general and special purposemicroprocessors, and the sole processor or one of multiple processors ofany kind of computer. Generally, a processor will receive instructionsand data from a read-only memory or a random access memory or both. Theessential elements of a computer are a processor for executinginstructions and one or more memories for storing instructions and data.Generally, a computer will also include, or be operatively coupled tocommunicate with, one or more mass storage devices for storing datafiles; such devices include magnetic disks, such as internal hard disksand removable disks; magneto-optical disks; and optical disks. Storagedevices suitable for tangibly embodying computer program instructionsand data include all forms of non-volatile memory, including by way ofexample semiconductor memory devices, such as EPROM, EEPROM, and flashmemory devices; magnetic disks such as internal hard disks and removabledisks; magneto-optical disks; and CD-ROM and DVD-ROM disks. Theprocessor and the memory can be supplemented by, or incorporated in,ASICs (application-specific integrated circuits).

To provide for interaction with a user, the features can be implementedon a computer having a display device such as a CRT (cathode ray tube)or LCD (liquid crystal display) monitor for displaying information tothe user and a keyboard and a pointing device such as a mouse or atrackball by which the user can provide input to the computer.Additionally, such activities can be implemented via touchscreenflat-panel displays and other appropriate mechanisms.

The features can be implemented in a computer system that includes aback-end component, such as a data server, or that includes a middlewarecomponent, such as an application server or an Internet server, or thatincludes a front-end component, such as a client computer having agraphical user interface or an Internet browser, or any combination ofthem. The components of the system can be connected by any form ormedium of digital data communication such as a communication network.Examples of communication networks include a local area network (“LAN”),a wide area network (“WAN”), peer-to-peer networks (having ad-hoc orstatic members), grid computing infrastructures, and the Internet.

The computer system can include clients and servers. A client and serverare generally remote from each other and typically interact through anetwork, such as the described one. The relationship of client andserver arises by virtue of computer programs running on the respectivecomputers and having a client-server relationship to each other.

While this specification contains many specific implementation details,these should not be construed as limitations on the scope of anyinventions or of what may be claimed, but rather as descriptions offeatures specific to particular implementations of particularinventions. Certain features that are described in this specification inthe context of separate implementations can also be implemented incombination in a single implementation. Conversely, various featuresthat are described in the context of a single implementation can also beimplemented in multiple implementations separately or in any suitablesubcombination. Moreover, although features may be described above asacting in certain combinations and even initially claimed as such, oneor more features from a claimed combination can in some cases be excisedfrom the combination, and the claimed combination may be directed to asubcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particularorder, this should not be understood as requiring that such operationsbe performed in the particular order shown or in sequential order, orthat all illustrated operations be performed, to achieve desirableresults. In certain circumfstances, multitasking and parallel processingmay be advantageous. Moreover, the separation of various systemcomponents in the implementations described above should not beunderstood as requiring such separation in all implementations, and itshould be understood that the described program components and systemscan generally be integrated together in a single software product orpackaged into multiple software products.

Thus, particular implementations of the subject matter have beendescribed. Other implementations are within the scope of the followingclaims. In some cases, the actions recited in the claims can beperformed in a different order and still achieve desirable results. Inaddition, the processes depicted in the accompanying figures do notnecessarily require the particular order shown, or sequential order, toachieve desirable results. In certain implementations, multitasking andparallel processing may be advantageous.

What is claimed is:
 1. A computer-implemented method, comprising:serving, from a computer server system and to a plurality of differentcomputing devices remote from the computer server system, web code thathas been recoded to obscure its operation from malware that may beoperating on the different computing devices; receiving from differentones of the computing devices, an obfuscated representation of aparticular parameter for a first of the computing devices, and anonobfuscated representation of the same parameter for a second of thecomputing devices; obfuscating the unobfuscated representation of theparticular parameter, and comparing the obfuscated representation forthe second of the computing devices with the obfuscated representationfor the first of the computing devices; and based on a determinationthat the obfuscated representations correspond to each other,correlating the obfuscated representation to the unobfuscatedrepresentation on the computer server system, wherein the code forreporting parameters of the computing devices includes code for allowingthe computing devices to determine whether to send an obfuscatedrepresentation or an unobfuscated representation.
 2. Thecomputer-implemented method of claim 1, wherein the code for allowingthe computing devices to determine whether to send an obfuscatedrepresentation or an unobfuscated representation comprises biasing datathat affects a frequency with which the computing devices select to sendthe plaintext representation or the hashed representation.
 3. Thecomputer-implemented method of claim 1, further comprising: receivingfrom the computing devices, unobfusctaed representations and obfuscatedrepresentations of a plurality of different parameters of the computingdevices; obfuscating the received unobfuscated representations tocreated obfuscated values; and using correlations between the obfuscatedvalues and the received unobfuscated representations to identifyparameters represented by the obfuscated representations.
 4. Thecomputer-implemented method of claim 1, further comprising using theobfuscated representation and the unobfuscated representation toidentify characteristics of malware executing on the computing devices.5. A computer-implemented method, comprising: serving, from a computerserver system and to a plurality of different computing devices remotefrom the computer server system, web code and code for reporting statusof the computing devices; receiving from one or more of the computingdevices, first data that indicates a parameter of the one or morecomputing devices, the first data in a compressed format; receiving fromone or more others of the computing devices, second data that indicatesthe parameter of the one or more others of the computing devices, thesecond data in an uncompressed format; and compressing the second dataand comparing the compressed second data to the first data to correlatethe first data to the second data, wherein the code for reporting statusof the computing devices includes code for allowing the computingdevices to determine whether to send the first data or the second data.6. The computer-implemented method of claim 5, wherein the code forallowing the computing devices to determine whether to send the firstdata or the second data comprises biasing data that affects a frequencywith which the computing devices select to send the first data or thesecond data.
 7. The computer-implemented method of claim 5, wherein thefirst data is compressed on the computing devices using hashing.
 8. Thecomputer-implemented method of claim 7, wherein the server system doesnot send hashing algorithm information to the computing devices.
 9. Thecomputer-implemented method of claim 5, further comprising using thecompressed format to represent the parameter in identifying aggregateactivity by multiple of the computing devices.
 10. Thecomputer-implemented method of claim 9, further comprising determiningfrom the aggregate activity by multiple of the computer devices whetherones of the multiple computing devices is infected with malware.
 11. Thecomputer-implemented method of claim 5, wherein the computer serversystem comprises an intermediary security server system that is separatefrom a web server system that generates and serves the web code.
 12. Thecomputer-implemented method of claim 5, further comprising comparinginformation sent with the compressed second data to information derivedfrom the received first data to determine whether the compressed seconddata was generated from data that matches the first data.
 13. One ormore non-transitory storage devices storing instructions that, whenexecuted by one or more computer processors, perform operationscomprising: serving, from a computer server system and to a plurality ofdifferent computing devices remote from the computer server system, webcode and code for reporting status of the computing devices; receivingfrom one or more of the computing devices, first data that indicates aparameter of the one or more computing devices, the first data in acompressed format; receiving from one or more others of the computingdevices, second data that indicates the parameter of the one or moreothers of the computing devices, the second data in an uncompressedformat; and compressing the second data and comparing the compressedsecond data to the first data to correlate the first data to the seconddata, wherein the code for reporting status of the computing devicesincludes code for allowing the computing devices to determine whether tosend the first data or the second data.
 14. The one or morenon-transitory storage devices of claim 13, wherein the code forallowing the computing devices to determine whether to send the firstdata or the second data comprises biasing data that affects a frequencywith which the computing devices select to send the first data or thesecond data.
 15. The one or more non-transitory storage devices of claim13, wherein the first data is compressed on the computing devices usinghashing.
 16. The one or more non-transitory storage devices of claim 13,wherein the operations further comprise using the compressed format torepresent the parameter in identifying aggregate activity by multiple ofthe computing devices.
 17. The one or more non-transitory storagedevices of claim 16, wherein the operations further comprise determiningfrom the aggregate activity by multiple of the computer devices whetherones of the multiple computing devices is infected with malware.
 18. Theone or more non-transitory storage devices of claim 13, wherein thecomputer server system comprises an intermediary security server systemthat is separate from a web server system that generates and serves theweb code.
 19. The one or more non-transitory storage devices of claim13, wherein the operations further comprise comparing information sentwith the compressed second data to information derived from the receivedfirst data to determine whether the compressed second data was generatedfrom data that matches the first data.